Finding Structure and Characteristics of Web Documents for Classification

نویسندگان

Wai-ching Wong

Ada Wai-Chee Fu

چکیده

Many Web documents containing the same type of information , would have similar structure. In this paper, we examine the problem of nding the structure of web documents and present a hierarchical structure to represent the relation among text data in the web documents. Due to the loose standard of web page publishing, diierent authors can use diierent wordings (labels) to label the same information. We introduced a labels discovery algorithm that uses the hierarchical structure extracted from the web pages. The algorithm discovers similar labels which describe the same kind of information. Such labels would help us nd the structure of the web documents. Experiments have shown that the algorithm can successfully discover similar labels and the structure obtained by our method can distinguish web pages accurately.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

Optimizing the Pre-Processing Phase of Automatic e-Document Classification

Electronic documents such as e-catalogs, e-mails, and Web documents have their own distinct characteristics that can be utilized in search and classification. They are structured, noisy, and, in some cases, related to each other. We analyze the characteristics of three major types of e-documents e-catalogs, e-mails, and Web documents and propose methods for optimizing automatic classification o...

متن کامل

An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification

The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...

متن کامل

An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

Finding Structure and Characteristics of Web Documents for Classification

نویسندگان

چکیده

منابع مشابه

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

A New Document Embedding Method for News Classification

Optimizing the Pre-Processing Phase of Automatic e-Document Classification

An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification

An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification

عنوان ژورنال:

اشتراک گذاری